Reconfigure CI pipelines for PR and merge queue #11526

jakelishman · 2024-01-09T15:57:23Z

Summary

This commit refactors the pipeline organisation of our CI to reduce the total number of jobs run on any given PR merge, and to reduce the critical path length of merging. In particular:

the lint, docs and QPY tests are combined into one job for a single CI worker. These three jobs individually add up to less than a test run, and moving QPY from the first-stage PR test run to this rebalances the (now) two jobs in the stage to be more equal in runtime.
a new pipeline is added specifically for the merge queue. Previously this reused the same two-stage PR pipeline. The two-stage system unnecessarily lengthened the critical path, as when a PR enters the merge queue, it as already passed PR CI, and so is highly likely to pass all jobs. In addition to flattening to a single stage, one each of the macOS and Windows jobs are removed to lower the amount of VMs needed and reduce the chances of a timeout (these OSes are more likely to get a dodgy VM and time out than Linux). The only way a new test failure should appear (other than a flaky test) is by a logical merge conflict, which would be quite unlikely to only affect a particular Python version.

To make it easier to run the lint and docs jobs together, and ensure that both run even if there's a failure in one, the full lint configuration is merged back to tox.ini, and that's used to do the linting. This makes it more consistent for developers, as well.

Details and comments

This should automatically already work with the branch-protection rules, but there's like a 99% chance I've messed something up in the Azure configuration on the first attempt, so let's see if CI passes.

qiskit-bot · 2024-01-09T15:57:28Z

One or more of the the following people are requested to review this:

@Qiskit/terra-core
@mtreinish
@nkanazawa1989

This commit refactors the pipeline organisation of our CI to reduce the total number of jobs run on any given PR merge, and to reduce the critical path length of merging. In particular: - the lint, docs and QPY tests are combined into one job for a single CI worker. These three jobs individually add up to less than a test run, and moving QPY from the first-stage PR test run to this rebalances the (now) two jobs in the stage to be more equal in runtime. - a new pipeline is added specifically for the merge queue. Previously this reused the same two-stage PR pipeline. The two-stage system unnecessarily lengthened the critical path, as when a PR enters the merge queue, it as already passed PR CI, and so is highly likely to pass all jobs. In addition to flattening to a single stage, one each of the macOS and Windows jobs are removed to lower the amount of VMs needed and reduce the chances of a timeout (these OSes are more likely to get a dodgy VM and time out than Linux). The only way a new test failure should appear (other than a flaky test) is by a logical merge conflict, which would be quite unlikely to only affect a particular Python version. To make it easier to run the lint and docs jobs together, and ensure that both run even if there's a failure in one, the full lint configuration is merged back to `tox.ini`, and that's used to do the linting. This makes it more consistent for developers, as well.

coveralls · 2024-01-09T16:54:42Z

Pull Request Test Coverage Report for Build 7548743872

Warning: This coverage report may be inaccurate.

We've detected an issue with your CI configuration that might affect the accuracy of this pull request's coverage report.
To ensure accuracy in future PRs, please see these guidelines.
A quick fix for this PR: rebase it; your next report should be accurate.

0 of 0 changed or added relevant lines in 0 files are covered.
No unchanged relevant lines lost coverage.
Overall coverage increased (+1.8%) to 89.349%

Totals
Change from base Build 7463210826:	1.8%
Covered Lines:	59368
Relevant Lines:	66445

💛 - Coveralls

The lint job will fail unless the Rust components are installed into the source directory, since that's where `pylint` will look for them to resolve the imports.

mtreinish

Overall this LGTM, just one question inline

mtreinish · 2024-01-16T22:03:33Z

.azure/lint_docs_qpy-linux.yml

-          # Clean up Sphinx detritus.
-          rm -rf docs/_build/html/{.doctrees,.buildinfo}
-        displayName: 'Run Docs build'
+      - bash: tox run -e docs,lint


Oh good idea to leverage tox for lint too. Save a lot of config logic here.

laziness is often a good motivator for single-source-of-truth configuration lol

mtreinish · 2024-01-16T22:11:12Z

tox.ini

+# `pylint` will examine the source code, not the version that would otherwise be
+# installed in `site-packages`, so we use an editable install to make sure the
+# compiled modules are built into a valid place for it to find them.
+package = editable


Maybe we should do this globally in the tox config. I'm constantly hitting this locally when using tox which is why I pushed #11380

Let's merge #11380 in that separate PR after this, maybe? To me, this change is required to fix the lint job (at the moment, tox -e lint only works if you happen to already have built the rust extensions in place yourself, hence the CI failure earlier), but #11380 isn't strictly needed, it's a change to the config to deliberately override some of tox's isolation as a trade-off to get better re-test performance.

jakelishman · 2024-01-16T23:19:20Z

Ha, the templating of the merge queue CI is bust - it crashed when added to the queue.

azure-pipelines.yml

jakelishman · 2024-01-16T23:22:24Z

Dunno how I borked that - I'd assumed I'd copy-pasted the condition, but apparently not. Hopefully better now.

* Reconfigure CI pipelines for PR and merge queue This commit refactors the pipeline organisation of our CI to reduce the total number of jobs run on any given PR merge, and to reduce the critical path length of merging. In particular: - the lint, docs and QPY tests are combined into one job for a single CI worker. These three jobs individually add up to less than a test run, and moving QPY from the first-stage PR test run to this rebalances the (now) two jobs in the stage to be more equal in runtime. - a new pipeline is added specifically for the merge queue. Previously this reused the same two-stage PR pipeline. The two-stage system unnecessarily lengthened the critical path, as when a PR enters the merge queue, it as already passed PR CI, and so is highly likely to pass all jobs. In addition to flattening to a single stage, one each of the macOS and Windows jobs are removed to lower the amount of VMs needed and reduce the chances of a timeout (these OSes are more likely to get a dodgy VM and time out than Linux). The only way a new test failure should appear (other than a flaky test) is by a logical merge conflict, which would be quite unlikely to only affect a particular Python version. To make it easier to run the lint and docs jobs together, and ensure that both run even if there's a failure in one, the full lint configuration is merged back to `tox.ini`, and that's used to do the linting. This makes it more consistent for developers, as well. * Allow cargo as an external in 'tox' * Add display name to lint job * Use editable install for tox lint job The lint job will fail unless the Rust components are installed into the source directory, since that's where `pylint` will look for them to resolve the imports. * Fix typo in merge-queue stage condition (cherry picked from commit ed79d42) # Conflicts: # .azure/lint-linux.yml # tox.ini

mtreinish · 2024-01-17T19:36:59Z

@Mergifyio backport stable/0.46

mergify · 2024-01-17T19:37:26Z

backport stable/0.46

✅ Backports have been created

#11587 Reconfigure CI pipelines for PR and merge queue (backport #11526) has been created for branch stable/0.46 but encountered conflicts

* Reconfigure CI pipelines for PR and merge queue This commit refactors the pipeline organisation of our CI to reduce the total number of jobs run on any given PR merge, and to reduce the critical path length of merging. In particular: - the lint, docs and QPY tests are combined into one job for a single CI worker. These three jobs individually add up to less than a test run, and moving QPY from the first-stage PR test run to this rebalances the (now) two jobs in the stage to be more equal in runtime. - a new pipeline is added specifically for the merge queue. Previously this reused the same two-stage PR pipeline. The two-stage system unnecessarily lengthened the critical path, as when a PR enters the merge queue, it as already passed PR CI, and so is highly likely to pass all jobs. In addition to flattening to a single stage, one each of the macOS and Windows jobs are removed to lower the amount of VMs needed and reduce the chances of a timeout (these OSes are more likely to get a dodgy VM and time out than Linux). The only way a new test failure should appear (other than a flaky test) is by a logical merge conflict, which would be quite unlikely to only affect a particular Python version. To make it easier to run the lint and docs jobs together, and ensure that both run even if there's a failure in one, the full lint configuration is merged back to `tox.ini`, and that's used to do the linting. This makes it more consistent for developers, as well. * Allow cargo as an external in 'tox' * Add display name to lint job * Use editable install for tox lint job The lint job will fail unless the Rust components are installed into the source directory, since that's where `pylint` will look for them to resolve the imports. * Fix typo in merge-queue stage condition (cherry picked from commit ed79d42) # Conflicts: # .azure/lint-linux.yml # tox.ini

…1587) * Reconfigure CI pipelines for PR and merge queue (#11526) * Reconfigure CI pipelines for PR and merge queue This commit refactors the pipeline organisation of our CI to reduce the total number of jobs run on any given PR merge, and to reduce the critical path length of merging. In particular: - the lint, docs and QPY tests are combined into one job for a single CI worker. These three jobs individually add up to less than a test run, and moving QPY from the first-stage PR test run to this rebalances the (now) two jobs in the stage to be more equal in runtime. - a new pipeline is added specifically for the merge queue. Previously this reused the same two-stage PR pipeline. The two-stage system unnecessarily lengthened the critical path, as when a PR enters the merge queue, it as already passed PR CI, and so is highly likely to pass all jobs. In addition to flattening to a single stage, one each of the macOS and Windows jobs are removed to lower the amount of VMs needed and reduce the chances of a timeout (these OSes are more likely to get a dodgy VM and time out than Linux). The only way a new test failure should appear (other than a flaky test) is by a logical merge conflict, which would be quite unlikely to only affect a particular Python version. To make it easier to run the lint and docs jobs together, and ensure that both run even if there's a failure in one, the full lint configuration is merged back to `tox.ini`, and that's used to do the linting. This makes it more consistent for developers, as well. * Allow cargo as an external in 'tox' * Add display name to lint job * Use editable install for tox lint job The lint job will fail unless the Rust components are installed into the source directory, since that's where `pylint` will look for them to resolve the imports. * Fix typo in merge-queue stage condition (cherry picked from commit ed79d42) # Conflicts: # .azure/lint-linux.yml # tox.ini * Fix conflict --------- Co-authored-by: Jake Lishman <[email protected]>

…1586) * Reconfigure CI pipelines for PR and merge queue (#11526) * Reconfigure CI pipelines for PR and merge queue This commit refactors the pipeline organisation of our CI to reduce the total number of jobs run on any given PR merge, and to reduce the critical path length of merging. In particular: - the lint, docs and QPY tests are combined into one job for a single CI worker. These three jobs individually add up to less than a test run, and moving QPY from the first-stage PR test run to this rebalances the (now) two jobs in the stage to be more equal in runtime. - a new pipeline is added specifically for the merge queue. Previously this reused the same two-stage PR pipeline. The two-stage system unnecessarily lengthened the critical path, as when a PR enters the merge queue, it as already passed PR CI, and so is highly likely to pass all jobs. In addition to flattening to a single stage, one each of the macOS and Windows jobs are removed to lower the amount of VMs needed and reduce the chances of a timeout (these OSes are more likely to get a dodgy VM and time out than Linux). The only way a new test failure should appear (other than a flaky test) is by a logical merge conflict, which would be quite unlikely to only affect a particular Python version. To make it easier to run the lint and docs jobs together, and ensure that both run even if there's a failure in one, the full lint configuration is merged back to `tox.ini`, and that's used to do the linting. This makes it more consistent for developers, as well. * Allow cargo as an external in 'tox' * Add display name to lint job * Use editable install for tox lint job The lint job will fail unless the Rust components are installed into the source directory, since that's where `pylint` will look for them to resolve the imports. * Fix typo in merge-queue stage condition (cherry picked from commit ed79d42) # Conflicts: # .azure/lint-linux.yml # tox.ini * Fix conflict --------- Co-authored-by: Jake Lishman <[email protected]>

This commit has two major goals: - fix the caching of the QPY files for both the `main` and `stable/*` branches - increase the number of compatibility tests between the different symengine versions that might be involved in the generation and loading of the QPY files. Achieving both of these goals also means that it is sensible to move the job to GitHub Actions at the same time, since it will put more pressure on the Azure machine concurrency we use. Caching ------- The previous QPY tests attempted to cache the generated files for each historical version of Qiskit, but this was unreliable. The cache never seemed to hit on backport branches, which was a huge slowdown in the critical path to getting releases out. The cache restore keys were also a bit lax, meaning that we might accidentally have invalidated files in the cache by changing what we wanted to test, but the restore keys wouldn't have changed. The cache files would fail to restore as a side-effect of ed79d42 (Qiskitgh-11526); QPY was moved to be on the tail end of the lint run, rather than in a test run. This meant that it was no longer run as part of the push event when updating `main` or one of the `stable/*` branches. In Azure (and GitHub Actions), the "cache" action accesses a _scoped_ cache, not a universal one for the repository [^1][^2]. Approximately, base branches each have their own scope, and PR events open a new scope that is a child of the target branch, the default branch, and the source branch, if appropriate. A cache task can read from any of its parent scopes, but write events go to the most local scope. This means that we haven't been writing to long-standing caches for some time now. PRs would typically miss the cache on the first attempt, hit their cache for updates, then miss again once entering the merge queue. The fix for this is to run the QPY job on branch-update events as well. The post-job cache action will then write out to a reachable cache for all following events. Cross-symengine tests --------------------- We previously were just running a single test with differing versions of symengine between the loading and generation of the QPY files. This refactors the QPY `run_tests.sh` script to run a full pairwise matrix of compatibility tests, to increase the coverage. [^1]: https://docs.github.com/en/actions/writing-workflows/choosing-what-your-workflow-does/caching-dependencies-to-speed-up-workflows#restrictions-for-accessing-a-cache [^2]: https://learn.microsoft.com/en-us/azure/devops/pipelines/release/caching?view=azure-devops#cache-isolation-and-security

…13273) This commit has two major goals: - fix the caching of the QPY files for both the `main` and `stable/*` branches - increase the number of compatibility tests between the different symengine versions that might be involved in the generation and loading of the QPY files. Achieving both of these goals also means that it is sensible to move the job to GitHub Actions at the same time, since it will put more pressure on the Azure machine concurrency we use. Caching ------- The previous QPY tests attempted to cache the generated files for each historical version of Qiskit, but this was unreliable. The cache never seemed to hit on backport branches, which was a huge slowdown in the critical path to getting releases out. The cache restore keys were also a bit lax, meaning that we might accidentally have invalidated files in the cache by changing what we wanted to test, but the restore keys wouldn't have changed. The cache files would fail to restore as a side-effect of ed79d42 (gh-11526); QPY was moved to be on the tail end of the lint run, rather than in a test run. This meant that it was no longer run as part of the push event when updating `main` or one of the `stable/*` branches. In Azure (and GitHub Actions), the "cache" action accesses a _scoped_ cache, not a universal one for the repository [^1][^2]. Approximately, base branches each have their own scope, and PR events open a new scope that is a child of the target branch, the default branch, and the source branch, if appropriate. A cache task can read from any of its parent scopes, but write events go to the most local scope. This means that we haven't been writing to long-standing caches for some time now. PRs would typically miss the cache on the first attempt, hit their cache for updates, then miss again once entering the merge queue. The fix for this is to run the QPY job on branch-update events as well. The post-job cache action will then write out to a reachable cache for all following events. Cross-symengine tests --------------------- We previously were just running a single test with differing versions of symengine between the loading and generation of the QPY files. This refactors the QPY `run_tests.sh` script to run a full pairwise matrix of compatibility tests, to increase the coverage. [^1]: https://docs.github.com/en/actions/writing-workflows/choosing-what-your-workflow-does/caching-dependencies-to-speed-up-workflows#restrictions-for-accessing-a-cache [^2]: https://learn.microsoft.com/en-us/azure/devops/pipelines/release/caching?view=azure-devops#cache-isolation-and-security

…13273) This commit has two major goals: - fix the caching of the QPY files for both the `main` and `stable/*` branches - increase the number of compatibility tests between the different symengine versions that might be involved in the generation and loading of the QPY files. Achieving both of these goals also means that it is sensible to move the job to GitHub Actions at the same time, since it will put more pressure on the Azure machine concurrency we use. Caching ------- The previous QPY tests attempted to cache the generated files for each historical version of Qiskit, but this was unreliable. The cache never seemed to hit on backport branches, which was a huge slowdown in the critical path to getting releases out. The cache restore keys were also a bit lax, meaning that we might accidentally have invalidated files in the cache by changing what we wanted to test, but the restore keys wouldn't have changed. The cache files would fail to restore as a side-effect of ed79d42 (gh-11526); QPY was moved to be on the tail end of the lint run, rather than in a test run. This meant that it was no longer run as part of the push event when updating `main` or one of the `stable/*` branches. In Azure (and GitHub Actions), the "cache" action accesses a _scoped_ cache, not a universal one for the repository [^1][^2]. Approximately, base branches each have their own scope, and PR events open a new scope that is a child of the target branch, the default branch, and the source branch, if appropriate. A cache task can read from any of its parent scopes, but write events go to the most local scope. This means that we haven't been writing to long-standing caches for some time now. PRs would typically miss the cache on the first attempt, hit their cache for updates, then miss again once entering the merge queue. The fix for this is to run the QPY job on branch-update events as well. The post-job cache action will then write out to a reachable cache for all following events. Cross-symengine tests --------------------- We previously were just running a single test with differing versions of symengine between the loading and generation of the QPY files. This refactors the QPY `run_tests.sh` script to run a full pairwise matrix of compatibility tests, to increase the coverage. [^1]: https://docs.github.com/en/actions/writing-workflows/choosing-what-your-workflow-does/caching-dependencies-to-speed-up-workflows#restrictions-for-accessing-a-cache [^2]: https://learn.microsoft.com/en-us/azure/devops/pipelines/release/caching?view=azure-devops#cache-isolation-and-security (cherry picked from commit af8be25)

…13273) (#13380) This commit has two major goals: - fix the caching of the QPY files for both the `main` and `stable/*` branches - increase the number of compatibility tests between the different symengine versions that might be involved in the generation and loading of the QPY files. Achieving both of these goals also means that it is sensible to move the job to GitHub Actions at the same time, since it will put more pressure on the Azure machine concurrency we use. Caching ------- The previous QPY tests attempted to cache the generated files for each historical version of Qiskit, but this was unreliable. The cache never seemed to hit on backport branches, which was a huge slowdown in the critical path to getting releases out. The cache restore keys were also a bit lax, meaning that we might accidentally have invalidated files in the cache by changing what we wanted to test, but the restore keys wouldn't have changed. The cache files would fail to restore as a side-effect of ed79d42 (gh-11526); QPY was moved to be on the tail end of the lint run, rather than in a test run. This meant that it was no longer run as part of the push event when updating `main` or one of the `stable/*` branches. In Azure (and GitHub Actions), the "cache" action accesses a _scoped_ cache, not a universal one for the repository [^1][^2]. Approximately, base branches each have their own scope, and PR events open a new scope that is a child of the target branch, the default branch, and the source branch, if appropriate. A cache task can read from any of its parent scopes, but write events go to the most local scope. This means that we haven't been writing to long-standing caches for some time now. PRs would typically miss the cache on the first attempt, hit their cache for updates, then miss again once entering the merge queue. The fix for this is to run the QPY job on branch-update events as well. The post-job cache action will then write out to a reachable cache for all following events. Cross-symengine tests --------------------- We previously were just running a single test with differing versions of symengine between the loading and generation of the QPY files. This refactors the QPY `run_tests.sh` script to run a full pairwise matrix of compatibility tests, to increase the coverage. [^1]: https://docs.github.com/en/actions/writing-workflows/choosing-what-your-workflow-does/caching-dependencies-to-speed-up-workflows#restrictions-for-accessing-a-cache [^2]: https://learn.microsoft.com/en-us/azure/devops/pipelines/release/caching?view=azure-devops#cache-isolation-and-security (cherry picked from commit af8be25) Co-authored-by: Jake Lishman <[email protected]>

jakelishman added type: qa Issues and PRs that relate to testing and code quality stable backport potential The bug might be minimal and/or import enough to be port to stable Changelog: None Do not include in changelog labels Jan 9, 2024

jakelishman requested a review from a team as a code owner January 9, 2024 15:57

jakelishman force-pushed the azure-reorg branch from 0ddb999 to e849c51 Compare January 9, 2024 16:12

jakelishman added 2 commits January 9, 2024 16:29

Allow cargo as an external in 'tox'

5e2baab

Add display name to lint job

c4d67e3

Use editable install for tox lint job

95d2a38

The lint job will fail unless the Rust components are installed into the source directory, since that's where `pylint` will look for them to resolve the imports.

mtreinish reviewed Jan 16, 2024

View reviewed changes

mtreinish previously approved these changes Jan 16, 2024

View reviewed changes

mtreinish added this pull request to the merge queue Jan 16, 2024

jakelishman removed this pull request from the merge queue due to a manual request Jan 16, 2024

jakelishman commented Jan 16, 2024

View reviewed changes

azure-pipelines.yml Outdated Show resolved Hide resolved

Fix typo in merge-queue stage condition

128be76

jakelishman dismissed mtreinish’s stale review via 128be76 January 16, 2024 23:21

mtreinish approved these changes Jan 17, 2024

View reviewed changes

mtreinish added this pull request to the merge queue Jan 17, 2024

Merged via the queue into Qiskit:main with commit ed79d42 Jan 17, 2024
12 checks passed

mergify bot mentioned this pull request Jan 17, 2024

Reconfigure CI pipelines for PR and merge queue (backport #11526) #11586

Merged

jakelishman deleted the azure-reorg branch January 17, 2024 19:05

mergify bot mentioned this pull request Jan 17, 2024

Reconfigure CI pipelines for PR and merge queue (backport #11526) #11587

Merged

jakelishman mentioned this pull request Oct 3, 2024

Move QPY tests to GitHub Actions and increase inter-symengine tests #13273

Merged

mergify bot mentioned this pull request Oct 29, 2024

Move QPY tests to GitHub Actions and increase inter-symengine tests (backport #13273) #13380

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reconfigure CI pipelines for PR and merge queue #11526

Reconfigure CI pipelines for PR and merge queue #11526

jakelishman commented Jan 9, 2024

qiskit-bot commented Jan 9, 2024

coveralls commented Jan 9, 2024 •

edited

Loading

mtreinish left a comment

mtreinish Jan 16, 2024

jakelishman Jan 16, 2024

mtreinish Jan 16, 2024

jakelishman Jan 16, 2024

jakelishman commented Jan 16, 2024

jakelishman commented Jan 16, 2024 •

edited

Loading

mtreinish commented Jan 17, 2024

mergify bot commented Jan 17, 2024 •

edited

Loading

Reconfigure CI pipelines for PR and merge queue #11526

Reconfigure CI pipelines for PR and merge queue #11526

Conversation

jakelishman commented Jan 9, 2024

Summary

Details and comments

qiskit-bot commented Jan 9, 2024

coveralls commented Jan 9, 2024 • edited Loading

Pull Request Test Coverage Report for Build 7548743872

Warning: This coverage report may be inaccurate.

💛 - Coveralls

mtreinish left a comment

Choose a reason for hiding this comment

mtreinish Jan 16, 2024

Choose a reason for hiding this comment

jakelishman Jan 16, 2024

Choose a reason for hiding this comment

mtreinish Jan 16, 2024

Choose a reason for hiding this comment

jakelishman Jan 16, 2024

Choose a reason for hiding this comment

jakelishman commented Jan 16, 2024

jakelishman commented Jan 16, 2024 • edited Loading

mtreinish commented Jan 17, 2024

mergify bot commented Jan 17, 2024 • edited Loading

✅ Backports have been created

coveralls commented Jan 9, 2024 •

edited

Loading

jakelishman commented Jan 16, 2024 •

edited

Loading

mergify bot commented Jan 17, 2024 •

edited

Loading